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Inferring the environment's statistical structure and adapting behav- 
ior accordingly is a fundamental modus operandi of the brain. A 
simple form of this faculty based on spatial attentional orienting can 
be studied with Posner's location-cueing paradigm in which a cue 
indicates the target location with a known probability. The present 
study focuses on a more complex version of this task, where prob- 
abilistic context ( percentage of cue validity) changes unpredictably 
over time, thereby creating a volatile environment. Saccadic 
response speed (RS) was recorded in 15 subjects and used to esti- 
mate subject-specific parameters of a Bayesian learning scheme 
modeling the subjects' trial-by-trial updates of beliefs. Different 
response models — specifying how computational states translate 
into observable behavior — were compared using Bayesian model se- 
lection. Saccadic RS was most plausibly explained as a function of 
the precision of the belief about the causes of sensory input. This 
finding is in accordance with current Bayesian theories of brain 
function, and specifically with the proposal that spatial attention is 
mediated by a precision-dependent gain modulation of sensory 
input. Our results provide empirical support for precision-dependent 
changes in beliefs about saccade target locations and motivate 
future neuroimaging and neuropharmacological studies of how 
Bayesian inference may determine spatial attention. 

Keywords: cue validity, hierarchical models, variational Bayes, visuospatial 
processing, volatility 

Introduction 

Prior beliefs about the location of a behaviorally relevant 
stimulus facilitate stimulus detection and speed reaction times 
(RTs). One of the first experimental demonstrations of this 
effect was provided by Posner's location-cueing paradigm 
(Posner 1980). In this task, a spatial cue (e.g., an arrow) indi- 
cates the most likely position of a behaviorally relevant target 
stimulus on a trial-by-trial basis. Average RTs are faster on 
valid trials — where the target appears at the expected or cued 
location — than on invalid trials, where target location is unex- 
pected. This reflects covert orienting of attention to the cued 
location in analogy to an attentional spotlight. Attentional or- 
ienting enhances information processing at the cued location 
at the expense of alternative (uncued) locations. 

However, there is accumulating evidence that attentional or- 
ienting in response to the spatial cue is not an all-or-none 
phenomenon, but is critically affected by trial history and by 
the current probabilistic context. For example, RT costs of 
invalid cueing are larger after a valid than after an invalid trial 
(Jongen and Smulders 2007) — and RTs to invalid targets 



increase with the number of preceding valid trials (Vossel 
et al. 2011). Moreover, the RT difference between invalid and 
valid trials increases, the higher the proportion of validly cued 
trials (percentage of cue validity [%CV]; Jonides 1980; Eriksen 
and Yeh 1985; Giessing et al. 2006; Risko and Stolz 2010). 
These results imply that subjects infer and predict the current 
probabilistic context and adjust their behavior accordingly. 

The behavioral effects observed in Posner's location-cueing 
paradigm can be interpreted within recent theoretical frame- 
works of perception and attention based on Bayesian prin- 
ciples (e.g., Rao 2005; Friston 2009, 2010; Itti and Baldi 2009; 
Chikkerur et al. 2010; Feldman and Friston 2010). Here, the 
brain is considered as a Bayesian inference machine (e.g., 
Dayan et al. 1995; Friston 2009) which maintains and updates 
a generative model of its sensory inputs. In other words, per- 
ception can be framed as an "inverse problem": under a 
specific generative model, the current state of the world has to 
be inferred from the noisy signals conveyed by the sensorium. 
Notably, even when stimuli are presented with a very high 
signal-to-noise ratio, there are many aspects about the state of 
the world (i.e., the cause of sensory inputs) that are nontrivial 
to infer, such as its probabilistic structure (the "laws" that 
relate causes of stimuli to each other) or nonlinear interactions 
among causes (e.g., visual occlusion). The overall goal of this 
architecture is to minimize surprise about sensory inputs and 
thus underwrite homeostasis — either by updating model- 
based predictions or by eliciting actions to sample the world 
according to prior expectations. Notably, because surprise 
about sensory inputs cannot be evaluated directly, it has been 
proposed that perception and action optimize a free-energy 
bound on surprise (Friston et al. 2006; Friston 2009, 2010). 
Based on this free-energy principle, simulations have demon- 
strated how spatially selective attention can be understood as a 
function of precision (confidence or inverse uncertainty) 
during perceptual inference: attentional selection serves to in- 
crease the precision of sensory channels, enabling faster 
responses to attended stimuli (Feldman and Friston 2010). 
Physiologically, this attentional effect may be mediated by an 
increase in the synaptic gain of neuronal populations encod- 
ing prediction error. These populations are assumed to project 
to higher level units in the visual hierarchy where faster 
changes in neuronal activity are engendered in the context of 
higher precision (for details, see Feldman and Friston 2010). 

An important aspect of Posner's location-cueing task relates 
to the trial-by-trial uncertainty about the predictive value of 
the spatial cue (i.e., the probability that the target appears at 
the cued location in a given trial) (cf. Yu and Dayan 2005). 
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This becomes particularly important in volatile environments, 
where the cue predicts the target location with varying prob- 
abilities over the course of the experiment — in other words, 
situations in which probabilistic context changes unpredicta- 
bly over time. Here, the estimate (representation) of this 
probability — which we will operationalize in terms of %CV — 
depends on the integration of information over past events. 

A simple description of trial-by-trial learning of cue-target 
contingencies is provided by reinforcement learning models 
such as Rescorla-Wagner (Rescorla and Wagner 1972). In 
these models, the update of the probability estimate (in our 
case, the probability that the target will appear in the cued 
hemifield) is the product of a fixed learning rate and a predic- 
tion error (i.e., the difference between observed and pre- 
dicted outcome). The learning rate determines the impact of 
the prediction error on the belief update and, at the same 
time, determines to what extent the current belief is affected 
by past events. In other words, it determines the influence of 
previous trials (cf. Rushworth and Behrens 2008). 

While the Rescorla-Wagner rule describes a variety of 
human and animal behaviors, it is a heuristic approach that 
does not follow from principles of probability theory. More- 
over, it suffers from some practical limitations that might be 
overcome by the application of Bayesian principles (Gersh- 
man and Niv 2010). For associative learning paradigms, hier- 
archical Bayesian learning models provide a principled 
prescription of how beliefs are updated optimally in the pres- 
ence of new data. These models may provide a more plausible 
account of behavior than the Rescorla-Wagner rule, particu- 
larly in volatile environments where a fixed learning rate is 
suboptimal (Behrens et al. 2007; den Ouden et al. 2010). 

Recently, a generic hierarchical, approximately Bayes- 
optimal learning scheme was introduced that grandfathers 
and extends existing normative models (Mathys et al. 2011). 
This model uses a variational approximation to the optimal 
Bayesian solution. This approximation results in analytical 
update equations that 1) minimize free energy, 2) are extre- 
mely fast to evaluate, 3) contain parameters allowing for indi- 
vidual differences in learning, and 4) directly express the 
crucial role of prediction errors (and their weighting by un- 
certainty) that play such a prominent role in predictive coding 
schemes based on the free-energy principle described above. 
Crucially, this Bayesian scheme can be applied to empirical 
behavioral data, allowing one to compare different models of 
subject responses and quantify their trial-by-trial estimates of 
states of the environment that lead to sensory predictions, in- 
cluding the precision of these estimates. This enables formal 
tests of free-energy-based accounts of attention using empiri- 
cally observed behavior that complements simulation work 
(e.g., Feldman and Friston 2010). In particular, one can estab- 
lish the aspects of a Bayesian learning model that are most 
influential in determining response speed (RS). While one 
might hypothesize a relationship between precision and RS in 
the present attentional cueing task (or even more generally; 
see, e.g., Whiteley and Sahani 2008), other studies (employ- 
ing different experimental paradigms) have shown that 
RTs can be related to the (log) probability estimate per se 
(Carpenter and Williams 1995; Anderson and Carpenter 2006; 
Brodersen et al. 2008; den Ouden et al. 2010), or to the 
amount of surprise that is associated with a particular stimu- 
lus (Bestmann et al. 2008). Here, we try to explain observed 
responses, under these different assumptions. To this end, we 



formulate competing models that embody different assump- 
tions and formally compare their evidence, using Bayesian 
model selection (BMS). Practically, in contrast to RTs, RS tend 
to have a Gaussian distribution (Carpenter and Williams 1995; 
Brodersen et al. 2008) and provide a better-behaved response 
measure for modeling. 

In particular, we here apply this hierarchal Bayesian learn- 
ing model to saccadic RS data from a variant of Posner's 
location-cueing paradigm with changes of probabilistic 
context (%CV) that are unknown to the subject. Saccadic eye 
movements and covert spatial attention are closely related and 
share a common functional neuroanatomy (Corbetta et al. 
1998; Nobre et al. 2000; Perry and Zeki 2000; Beauchamp 
et al. 2001; de Haan et al. 2008). There is strong evidence that 
eye movements to a given location are inevitably preceded by 
covert attention shifts to this location, enhancing local percep- 
tual processing (e.g., Deubel and Schneider 1996; Godijn and 
Theeuwes 2003; Dore-Mazars et al. 2004; Deubel 2008). The 
"premotor theory of attention" (Rizzolatti et al. 1987) states 
that attentional orienting may be functionally equivalent to 
saccade planning and initiation, and that therefore program- 
ming a saccade causes a shift of spatial attention. In a related 
theory, the "Visual Attention Model" (Schneider 1995), a 
single visual attention mechanism is proposed that controls 
both the selection for perception and the selection for action. 
Here, attention shifts are not caused by — but are a precondi- 
tion for — saccade preparation (Deubel 2008). The obligatory 
coupling between spatial attention and saccade programming 
is also evident in a recent computational model of evidence 
accumulation in the visuoniotor cascade: visually responsive 
neurons that can be found in the frontal eye fields (FEF), the 
lateral intraparietal area, and superior coUiculi (SC) provide 
the source of drive for motor neurons in FEF and SC to elicit a 
saccade (Schall et al. 2011). 

Saccadic RS have been shown to be affected by the prob- 
ability of the saccade target location (Carpenter and Williams 
1995; FarreU et al. 2010; Chiau et al. 2011), and there is initial 
evidence that trial-by-trial changes in saccadic RS reflect learn- 
ing of probabilistic context according to Bayesian principles 
(Anderson and Carpenter 2006; Brodersen et al. 2008). Ander- 
son and Carpenter (2006) presented 2 subjects with multiple 
trial blocks, in which targets initially appeared to the left and 
right side of fixation with equal probability. After 70-120 trials 
in each block, this probability could change abruptly, so that 
saccades were more likely to be made to one of the targets. By 
fitting an exponential function — modeling the trial-by-trial 
probability of the target location — the authors showed that 
saccadic RS is related to the learned prior probability of target 
appearance. Similarly, Brodersen et al. (2008) presented 3 sub- 
jects with blocks of left and right targets with different sto- 
chastic properties: the targets were either presented with 
different fixed probabilities, or the probability of the target 
location was conditional on the target location in the previous 
trial (first-order Markov sequence). They used 2 different 
learning models to ask whether the subjects learned and uti- 
lized the marginal probabilities of the target locations or their 
conditional probabilities (and thus a probability transition 
matrix). 

While both studies (Anderson and Carpenter 2006; Broder- 
sen et al. 2008) address the question of intertrial variability in 
probabilistic beliefs, they do not deal with the effects of the un- 
certainty (precision) of these beliefs, which have been formally 
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implicated in spatial attention (Feldman and Friston 2010). 
Moreover, both studies employed models that are agnostic 
about environmental volatility, thereby precluding the possi- 
bility that the subjects can adapt their learning rates, based on 
their current belief about the volatility of the environment. 

Here, we extend the previous findings in 2 ways. First, we 
show that trial-by-trial RS in the location-cueing paradigm can 
be explained as a function of the precision of trialwise beliefs, 
as inferred using hierarchical Bayesian inference (Mathys et al. 
2011). Second, our model accommodates individual learning 
processes by introducing subject-specific parameters that 
couple hierarchical levels and thus provides a novel quantifi- 
cation of, and explanation for, individual learning differences. 
In what follows, we will refer to the hierarchal Bayesian learn- 
ing model as the "perceptual model," because this model pro- 
vides a mapping from hidden states (or environmental causes) 
to sensory inputs (Daunizeau, den Ouden, Pessiglione, Kiebel, 
Stephan et al. 2010; Daunizeau, den Ouden, Pessiglione, 
Kiebel, Friston et al. 2010). Furthermore, we wiU introduce 
and compare different "response models" (Daunizeau, den 
Ouden, Pessiglione, Kiebel, Stephan et al. 2010; Daunizeau, 
den Ouden, Pessiglione, Kiebel, Friston et al. 2010) that de- 
scribe the mapping from the subject's probabilistic represen- 
tations (beliefs) — as provided by the perceptual model — to the 
observed responses (i.e., RS). 



Materials and Methods 

Subjects 

Sixteen healthy subjects gave written informed consent to participate 
in the current study. One subject had to be excluded from further 
analysis due to lack of fixation during the cue-target interval. There- 
fore, data from 15 subjects were analyzed (9 males, 6 females; age 
range from 23 to 35 years; mean age 27.4 years). All subjects were 
right-handed and had normal or corrected to normal vision. The 
study had been approved by the local ethics committee (University 
College London). 



Stimuli and Experimental Paradigm 

We used a location-cueing paradigm with central predictive cueing 
(Posner 1980). Stimuli were presented on a 19-inch monitor (spatial 
resolution 1024 x 768 pixels, refresh rate 75 Hz) with a viewing dis- 
tance of 60 cm. On each trial, 2 peripherally located boxes were 
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Figure 1. Illustration of the experimental task and the manipulation of %CV over the 
612 trials. 



shown (1.9° wide and 8° eccentric in each visual field, see Fig. 1) that 
could contain target stimuli. A central diamond (0.65° eccentric in 
each visual field) was placed between them, serving as a fixation 
point. Cues comprised a 200-ms increasing brightness of one side of 
the diamond — creating an arrowhead pointing to one of the periph- 
eral boxes. After a 1200-ms stimulus onset asynchrony (SOA), a target 
appeared for 100 ms in one of the boxes. The targets were vertical 
and horizontal circular sinusoidal gratings (1.3° visual angle). Vertical 
and horizontal gratings were presented with equal probability. 

Subjects were in.structed to maintain central fixation during the cue 
period and to make a saccade to the target stimulus as fast as poss- 
ible. They were encouraged to blink and refixate the central fixation 
dot after the saccade. After a short practice session of 64 trials — with 
constant 88%CV — the experiment comprised 612 trials with blockwise 
changes in %CV that were unknown to the subjects. After half of the 
trials, the subjects had a short rest of 1 min. Each block with constant 
%CV contained an equal number of left and right targets, counterba- 
lanced across valid and invalid trials. %CV changed after either 32 or 
36 trials switching unpredictably to levels of 88%, 69%, or 50% (see 
Fig. 1). Subjects were told in advance that there would be changes in 
%CV over the course of the experiment, but were not informed about 
the levels of these probabilities or when they would change. Each 
subject was presented with the same sequence of trials. This is a stan- 
dard procedure in computational studies of learning processes that 
require inference on conditional probabilities in time series (cf. 
Behrens et al. 2007; Daunizeau, den Ouden, Pessiglione, Kiebel, 
Friston et al. 2010). In these situations, the parameters of the learning 
process depend on the exact sequence of trials used. Although this 
dependency will diminish asymptotically with increasing numbers of 
trials, for the relatively short sequences (of a few hundred trials at 
best) that are feasible within a standard experiment, introducing a 
different sequence for each participant could increase the variability 
of parameter estimates, over and above the intrinsic interindividual 
trait-differences per se. We therefore decided to keep trial sequence 
constant to ensure that differences in model parameters can be attrib- 
uted to subject-specific rather than task-specific factors. 



Eye Movement Data Recording and Analysis 

Participants sat in a dimly lit sound-proof cabin with their head stabil- 
ized by a chinrest. Eye movements were recorded from the right eye 
with an EyeLink 1000 desktop mounted eye-tracker (SR Research Ltd) 
with a sampling rate of 250 Hz. A 9-point eye-tracker calibration and 
validation was performed at the start of the experiment and after the 
pause in the middle of the experiment. The validation error was <1° 
of visual angle. 

Eye movement data were analyzed with MATLAB (Mathworks) and 
ILAB (Gitelman 2002). Blinks were filtered out and pupil coordinates 
within a time window of 20 ms around the blink were removed. Trials 
with >20% missing data were discarded from the analyses. To ensure 
central fixation after presentation of the spatial cue, the period 
between cue and target was analyzed for gaze deviations from the 
center. After target appearance, only the first saccade was analyzed. 
Saccades were identified when the eye velocity exceeded 30°/s (Fischer 
et al. 1993; Stampe 1993). After this threshold was reached, the begin- 
ning of the saccade was defined as the time when the velocity exceeded 
15% of the trial-specific maximum velocity (Fischer et al. 1993). Like- 
wise, the end of the saccade was defined as the time when the velocity 
fell below 15% of the trial-specific maximum velocity. Moreover, the 
saccade amplitude needed to subtend at least two-thirds of the distance 
between the fixation point and the actual target location. Saccadic RT 
was defined as the latency between target and saccade onset. Saccades 
in which the starting position was not within a region of 1° from the 
fixation point and saccades with a latency <90 ms were discarded from 
the analyses. Our analyses focused on inverse RTs (i.e., RS) since, in 
contrast to RTs, RS are normally distributed (cf. Carpenter and Williams 
1995; Brodersen et al. 2008). 

To assess the effect of probabilistic context (true %CV), mean RS for 
each subject and for each %CV condition were entered into a 2 (cue: 
valid, invalid) x 3 (%CV: 50, 69, 88%) within-subjects analysis of var- 
iance (ANOVA). In this analysis, evidence for an impact of probabilistic 
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context would be reflected in a significant cue x %CV interaction effect 
— with increasing differences between valid and invalid RS with higher 
%CV. Results from this analysis are reported in the Results section at a 
significance level of P<0.05 after Greenhouse-Geisser correction. 
Condition-specific mean RS was also calculated separately for the 2 
halves of the experiment and analyzed with a 2 (cue: valid, invalid) x 3 
(%CV: 50, 69, 88%) X 2 (time: first half, second half) within-subjects 
ANOVA (note that each %CV condition was presented 3 times in each 
half, cf Fig. 1). 

Having established the significance of the experimental effects, we 
then sought to model them in terms of hierarchical Bayesian updating: 



Perceptual Model 

In what follows, we briefly outline the generative perceptual model 
(for details on the mathematical derivation of the update equations 
see Appendix section and Mathys et al. 2011). The perceptual model 
(dark gray panel in Fig. 2) comprises a hierarchy of 3 hidden states 
(denoted by x), with states 2 and 3 evolving in time as Gaussian 
random walks. 

The probability of a target appearing at the cued location in a 
given trial (0 (represented by the state xj'' , with Xj = 1 for valid and 
Xi = 0 for invalid targets) is governed by a state X2 at the next level of 
the hierarchy. (Note that in this particular experiment the target 
stimulus was visible without any ambiguity [very high signal-to-noise 
ratio]; this means there is a simple deterministic mapping between 
the [mean of] Xj and input u of the general model, which allows for 
situations with perceptual ambiguity [e.g., visual noise[.) X2 is a real 
number, and the probability distribution of Xj given X2 is described 
by a logistic sigmoid (softmax) function, so that the states Xi = 0 and 
Xi = 1 are equally probable when X2 = 0. 

p(xi')|4")=.{4"rS"{i-.{x<"))'-!" (1) 

with 5(x)def(l/(l + e-")) andxi G {0, 1}. 

Hence, in the current location-cueing paradigm, X2 determines the 
trial-specific estimate for %CV. The probability of X2 itself changes 
over time (trials) as a Gaussian random walk, so that the value Xj'' 
will be normally distributed around Xj' from the previous trial, 
with the variance of the distribution described by the term e^s 
(This is a simplified version of the full model in Mathys et al. 2011, in 
which a scaling parameter (k) has been set to 1.) 

P(x<'VM4'-^^"^"). (2) 




Figure 2. Graphical illustration of the perceptual (generative) model with states x,, 
Xi, and /3. The model parameters w and 1^ impact on the time course of subjects' 
inferred belief about the states x and are estimated from the individual subject RS 
data. Circles represent constants, while diamonds represent quantities that change in 
time (i.e., that carry a time (or trial) index). Hexagons, like diamonds, represent 
quantities that change in time but that additionally depend on their previous state in 
time in a Markovian fashion. 



Changes in X2 over time (trials) are thus determined by the quantities 
X3 (the 3rd level of the hierarchy) and a subject-specific parameter (o 
that allows for individual differences in the updating of X2. Accord- 
ingly, X, and a can be regarded as state-dependent and subject- 
specific (trait-like) measures of log-volatility (trial-by-trial variability 
in X2), respectively. The state Xj ' (Fig. 2) on a given trial is normally 
distributed around x^' with a variance determined by the constant 
subject-specific parameter 1?. The parameter # is a measure of meta- 
volatility (volatility of volatility) that determines the variability of the 
log-volatility over time. 

p(x«)~W{x^'',#). (3) 

To map from the sensory inputs to the probabilistic representations of 
the subject, the perceptual model needs to be inverted to obtain pos- 
terior densities on the hidden states x. In the following, the sufficient 
statistics of the subject's posterior belief will be denoted by /i (mean) 
and (T (variance) or 7r= (l/cr) (precision). We use the hat symbol C^) 
to denote predictions before the observation of Xi on a given trial. 
Variational inversion under a mean field approximation yields simple 
analytical update equations — where belief updating rests on 
precision- weighted prediction errors. The update of the posterior 
mean at level i in the hierarchy on trial t has the following general 
form (at the second level of the model in this study, the precision 
weighting has a slightly different form, i.e., tt^^ / [tt^^ tt^^ + \) , 
because of the sigmoid transform that relates the second level to the 
first; see equation A2.): 

A^.S'V^S!.", (4) 

In equation (4), tt\'\ is the precision of the prediction about the state 
at the level below and is the precision of the posterior belief 
about the state at the current level, while Sj^j is the prediction error 
about the input from the level below. For the derivation of these 
updates and their detailed form, see the Appendix section and Mathys 
et al. (2011). In brief, these equations provide approximately Bayes- 
optimal rules for the trial-by-trial updating of the representations 
(beliefs) that determine the subject's estimate of the probability that 
the target appears at the cued location on a particular trial. Note that 
this is an individualized Bayes optimality, in reference to the subject- 
specific values for the parameters a (determining subject-specific log- 
volatility) and # (subject-specific meta- volatility). 

It is interesting to note that the general update equations (4) 
arising from the variational hierarchical Bayesian scheme are formally 
similar to reinforcement learning models such as the Rescorla- 
Wagner rule (Rescorla and Wagner 1972). As described in detail in 
Mathys et al. (2011), the precision weighting of the updates at the 
second level can be understood as a time-varying learning rate, which 
varies with the state-dependent component yiij of the log-volatility 
(see Appendix section for details). An alternative — but equally useful — 
perspective on the generic update scheme in equation (4) is in terms 
of Bayesian filtering, for example, Kalman filtering. The Kalman filter 
can be regarded as an extension of the Rescorla-Wagner rule. It for- 
malizes the predictive relationship between events, but also comprises 
expectations about how this relationship is expected to change over 
time and takes into account the uncertainty about this prediction 
(Dayan 2000). In this context, the precision-dependent weighting of 
prediction errors in our scheme corresponds to something called the 
Kalman gain that is applied to prediction errors to provide optimal pre- 
dictions about the future. These perspectives on precision (reinforce- 
ment learning rates and Kalman gain) illustrate the formal equivalence 
between reinforcement learning, predictive coding, and Bayesian filter- 
ing, disclosed by the general scheme used here. 

In addition to the full hierarchical Bayesian model, we employed 2 
reduced versions of the perceptual model. This allowed us to evaluate 
whether the relatively complex hierarchical model was needed to 
explain our subjects' behavior. Specifically, the full hierarchical model 
assumes that 1) subjects are capable of learning the hierarchical struc- 
ture of the probabilities in this experiment and 2) exploit this 
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knowledge to dynamically adapt the speed at which they update 
beliefs (i.e., learning rate) by using precision-weighted prediction 
errors. Although these assumptions are theoretically well founded (cf. 
Mathys et al. 2011), it needs to be shown that equivalent explanations 
of the data could not be afforded by simpler, nonhierarchical learning 
models. Therefore, we specified 2 alternative perceptual Bayesian 
models that eschewed assumptions about hierarchically structured 
learning, but in different ways. The first alternative model assumed 
that subjects ignored the instructions that the environment was vola- 
tile, expecting negligible changes in log-volatility (third level): i? was 
thus fixed to zero, and only a was estimated. The second perceptual 
model did not use estimates of environmental volatility to adapt learn- 
ing. In this model, the influence of on the variance of X2 was there- 
fore fixed to zero (cf. eq. 2), so that levels 2 and 3 of the model 
became decoupled and rendered the values at the third level of the 
model irrelevant (an equivalent effect is obtained by fixing x^'' to 
zero). 



Crucially, the 3 competing response models differ in how they 
specify the dependence of a on computational quantities from 
the perceptual model: these are precision, belief, and surprise 
about the sensory signal, respectively. All 3 models respected the 
same boundary conditions, i.e., a remained confined to the unit 
interval with a = 0.5 when (i^ = 0.5 (cf. Appendix section and 
Fig. 4). 

The first response model focused on the precision estimate at the 
first level of the perceptual model — following the recent proposal by 
Feldman and Friston (2010) concerning the role of precision for 
spatially selective attention in the location-cueing paradigm. Here, we 
assumed that on a given trial f, the attentional factor a''' was deter- 
mined by a sigmoid transformation is) of 'fr['\ the precision of the 
prediction at the first level, relative to its minimal value (i.e., 4 when 
Al = 0.5): 

-4). (6) 



Response Models 

To map from the subject's posterior beliefs to observed responses, 3 
different response models were compared. A detailed analysis and 
motivation of their functional forms can be found in the Appendix 
section. All response models predict inverse RT (RS), since the distri- 
bution of RS is typically normal, in contrast to RTs themselves (Car- 
penter and Williams 1995). Furthermore, all response models 
describe trialwise RS as a linear function of an attentional factor a, 
based on the posterior beliefs of the perceptual model. This factor 
can be regarded as the proportion of attentional resources allocated 
to the cued location (i.e., a is normalized to the unit interval): 

PS-/ ^i«.id+f2 for jci = 1 (i.e., valid trial), 

Ui,„..M+f2(l-«) forxi=0(i.e.,invalidtrial). 

Note that in all cases, RS is the same function of attentional resources 
allocated to the outcome location: on valid trials, this is the amount of 
attentional resources a allocated to the cued location, while — on 
invalid trials — it is the amount of attentional resources 1 — a allocated 
to the uncued location (cf. Fig. 3). Here, and ^2 

subject-specific parameters that are estimated from the data. Minimal 
and maximal RS for valid and invalid trials are then defined by 
fi.^d/,..a,>d and fi„,„_„„„,„ + ^2, respectively. 




Invalid 



0.5 1 
£1! 

Figure 3. Illustration of the relationship between RS (inverse RT) and the quantity a, 
representing the amount of attentional resources allocated to the cued location. For 
each response model, RS were assumed to be linearly related to a (which differs 
between the 3 models, see Appendix). Note the opposite behaviour of RS for 
increasing a on valid (black line and equation) and invalid (gray line and equation) 
trials (cf. eq. 5). 



In the second response model, the "belief model, the attentional 
factor a depended on the strength of the prediction about CV: 

a(')=Al"=^(Mr''). (7) 

The third response model (surprise) was based upon the (Shannon) 
surprise associated with the target stimulus. The Shannon surprise 
(Shannon 1948) is the negative logarithm of a probability (here Ai'')- 
This response model was inspired by a previous study on cueing of 
motor responses in which RTs were examined in relation to trialwise 
surprise (Bestmann et al. 2008). Here, we defined a as a nonlinear 
function of Shannon surprise: 



(1 + surprise(Ai'')) 

with surprise (aS'') = ^log2p{xf' = l|Ai'') = ^log2(Ai'')- 

In summary, we specified 3 alternative perceptual models and 3 
alternative response models. This resulted in a 3 x 3 factorial model 
space. We compared the relative plausibility of these models using a 
random effects BMS procedure at the group level, both for individual 
models and model families (Stephan et al. 2009; Penny et al. 2010). In 
addition, we compared these models to a standard Rescorla-Wagner 
learning model as well as to a model assuming that the true under- 
lying (categorical) probabilities were known to subjects — in other 
words, they did not have to be inferred on the basis of experience. In 
the latter 2 models, trialwise RS was supposed to be linearly related 
to the estimated or true %CV, respectively. 

Estimation of the Model Parameters 

The perceptual model parameters o) and 1?, as well as the response 
model parameters ^ii„„iij, and ^2 were estimated from the trial- 

wise RS measures using variational Bayes. This enabled us to obtain 
an estimate of the log model evidence for model comparison and to 
evaluate the posterior densities of the model parameters. In short, 
variational Bayes optimizes the (negative) free-energy f as a lower 
bound on the log-evidence, such that maximizing F minimizes the 
KuUback-Leibler divergence between exact and approximate pos- 
terior distributions (for details, see Friston et al. 2007; Penny et al. 
2007). MATLAB functions for the variational Bayes scheme were 
derived from the DAVB toolbox (Daunizeau et al. 2009; dl.dropbox. 
com/u/185270l4/CODE/DAVB.zip). This approach is analogous to 
the Bayesian inversion of Dynamic Causal Models for functional 
imaging or electrophysiological data (dynamic causal modeling 
[DCM], Friston et al. 2003; Daunizeau et al. 2011). 

As any Bayesian approach, variational Bayesian inversion requires 
the definition of priors on the parameters. Importantly, the prior (co) 
variance influences the estimability of parameters, e.g., their degree 
of independence; also by choosing a very small prior variance (very 
high prior precision) one can effectively fix the value of a parameter. 
Table 1 provides the priors used for inverting the full hierarchical 
model. In the perceptual model, starting values for /i and a of states 2 
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Figure 4. Illustration of the amount of attentional resources a for the 3 different theoretical response models as a function of /t, . 



Table 1 

Prior mean and variance for the parameters of the perceptual and response models, and the 
noise parameter 

Parameter Prior mean Prior variance 

Perceptual model 



CO 


-6 


100 


{) 


C.I 


100 


Response model 








0.0052 


0.1 




0.0052 


0.1 




0.0006 


0.001 


Noise parameters 






f3 


0.001 


1000 



Note: iS is estimated in logit-space, while valid. fi invalid, arid ^2 are estimated in log-space. 



and 3 were fixed and an upper bound of 1 was defined for the par- 
ameter In the response model, the prior variance for (2, which 
parameterizes the relationship between the attentional factor a and 
RS (Fig. 3), was set to a fairly small value (10^^^). In other words, we 
assumed that the relation between RS and a (see eq. 5) did not differ 
greatly across subjects. In contrast, to account for individual baseline 
differences in RS (i.e., the intercept of the linear slope); the response 
model parameters and Si.^^^,.^ were given a larger prior variance, 
allowing for substantial individual differences between subjects. 

While trials with missing responses did not contribute to parameter 
estimation, they did contribute to estimating the evolution of the 
states X, since they still provided the subject with an observation 
about the cue-target contingency. In other words, we used what the 
subject saw to estimate the Bayes-optimal estimate of hidden states 
over the experiment — under a particular set of parameters and used 
subject responses to optimize the parameters of the perceptual and 
response models. 



Bayesian Model Selection 

BMS evaluates the relative log-evidence (or log-marginal likelihood) 
of alternative models. The log-evidence of a model is the negative sur- 
prise about the data, given a model, and represents a generic trade-off 
between the accuracy and complexity of a model that can be derived 
from first principles of probability theory. Over the past decade, BMS 
has become a standard approach to assess the relative plausibility of 
competing models that describe how neurophysiological or behavior- 
al responses are generated (cf. Stephan et al. 2009; Daunizeau, den 
Ouden, Pessiglione, Kiebel, Stephan et al. 2010, Daunizeau, den 
Ouden, Pessiglione, Kiebel, Friston et al. 2010). Here, we use it to dis- 
ambiguate different hypotheses about how learning (as described by 
the perceptual models) and decision making (as described by the 
response models) evolve across and within trials. 

Above, we introduced 3 perceptual models and 3 response models 
("precision", "belief, and "surprise"). Combining these alternatives 
provides 9 models in a 3 x 3 factorial model space, plus the additional 
2 control models (standard Rescorla-Wagner model and a model as- 
suming that the true probabilities were known to the subjects). To 
assess the relative plausibility of our models at the group level, we 
used random effects BMS (Stephan et al. 2009) and report both pos- 
terior probabilities and the exceedance probabilities of the competing 
models. Importantly, random effects BMS treats the model itself as 
being selected probabilistically by each subject in the population; i.e., 
as a random effect following a Dirichlet distribution. In brief, this 
enables group-level inference while accounting for interindividual 
differences (e.g., the optimal model can vary across subjects). Criti- 
cally, random effects BMS not only assesses the relative goodness of 
competing models but also quantifies (via the Dirichlet parameter 
estimates) the degree of heterogeneity in the sample studied (Stephan 
et al. 2009). 

The exceedance probability of a model is the probability that it is 
more likely than any other model considered, given the data. For 
example, an exceedance probability of 95% for a particular model 
means that one has 95% confidence that this model has a greater 
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posterior probability than any other model tested (Stephan et al. 
2009)- Both posterior probabilities and exceedance probabilities sum 
to unity over all models tested. 

Reproducibility of Results 

To examine the reproducibility and hence generalizability of our find- 
ings, we performed an additional analysis, using an independent set of 
subjects (« = 16, 8 males, 8 females: age range from 19 to 30 years; 
mean age 23-4 years). Again, all subjects were right-handed and had 
normal or corrected to normal vision. The subjects were tested as part 
of a separate psychopharmacological study employing a within-subject 
cross-over design. The data presented here were taken from the placebo 
session only, during which the subjects received a multivitamin tablet. 
This study was approved by the NHS Research Ethics Committee. 

The subjects were presented with exactly the same trial sequence 
as in the original study. The within-trial structure was also almost 
identical, with slight modifications to the timing of the task: the cue- 
target SOA was reduced to 800 ms and the target was presented for 
200 ms. Moreover, the trials were interspersed with 108 "null-trials" 
where only the baseline display (the fixation point and peripheral 
boxes) was shown. The task lasted 35 min and comprised 4 short rest 
periods. Finally, the subjects received a slightly longer training than 
the original group (one session with 100 trials with constant 80%CV 
and one session with 121 trials with changes in %CV). The same pro- 
cedures and analyses as outlined above were applied to the eye move- 
ment data, except that the data here were recorded with a sampling 
rate of 1000 Hz. Using trialwise RS, we again fitted the parameters of 
the perceptual and response models outlined above. 



Results 

Fixation During the Cue-Target Interval and Missing 
Trial Data 

Between the appearance of the cue and the target, the subjects 
fixated the center of the display in 87.7 ± 2.3% (mean ± SEM) of 
the trials — within a region of interest of 1° — and in 95.4 ± 1.2% 
of the trials, within in a region of 2° from the fixation point. 
The proportion of trials with missing eye data or missing or 
incorrect saccades amounted to 20.0 ±3%, so that on average 
80% of the trials (487 of 612 trials) were analyzed. Trials ex- 
cluded from analysis were due to anticipated responses 
(3±1%), incorrect or absent saccades (5±1%), saccades not 
starting from the fixation zone (8 ± 1%), or missing data points, 
e.g., due to blinks (4 ± 1%). There was no significant difference 
in the percentage of correct trials between the first and second 
half of the experiment (paired f-test, 0.895). 



Classical Inference About the Effects of Probability 
onRS 

The 2 (cue: valid, invaUd) x 3 (%CV: 50, 69, 88%) ANOVA on 
RS data revealed a significant main effect of cue (/^i,i4 = 8.8, 
P=0.01) reflecting faster responses (higher RS) on valid than 
on invalid trials. The main effect of %CV was not significant — 
in other words, averaging over valid and invalid trials 
removed any effect of probability. Crucially, we observed a 
significant cuex%CV interaction effect (^1.9,26.6 = 9-5, 
P= 0.001) reflecting a differential impact of %CV on valid and 
invalid trials (Fig. 5). A separate analysis also considered 
general trends in the data over time, e.g., due to fatigue, by 
including time (first vs. second half of the experiment) as 
additional factor. This resulted in a 3-factorial cue (valid, 
invalid) x%CV (50, 69, 88%) x time (first, second half) 
ANOVA. Again, this analysis revealed a main effect of cue 
(Fi i4 = 8.2, P= 0.013) and a significant cuex%CV interaction 
(^1.6,22.5 = 10.5, P= 0.001). The main effect of %CV was not 
significant. Importantly, there was neither a significant main 
effect of time nor interaction effects of the factor time with 
any of the other factors (all P's > 0.4). 

The cue x %CV interaction effect indicates a significant 
influence of probabilistic context on the subjects' responses, 
with stronger attentional orienting to the cue (and higher RT 
costs after invalid cueing) with higher %CV. However, 
Figure 5 does not show a strictly monotonic relationship 
between RS and true %CV for valid cues. This probably 
results from the fact that the underlying probabilistic structure 
(i.e., %CV) was unknown to the subjects and was changing in 
time fairly rapidly. It therefore had to be inferred by the sub- 
jects online, and these subject-specific and dynamic estimates 
should be the relevant predictors of observed RS, not %CV. In 
other words, the ANOVAs above (and the results in Fig. 5) 
average across trials that are heterogeneous in terms of 
subjective probability estimates, and a model predicting the 
subjective estimates should be superior in explaining behav- 
ior (cf. Fig. 9). In what follows, we test this hypothesis, 
asking whether the empirically observed RS might reflect 
trial-by-trial updating of the subjects' beliefs according to 
our Bayesian perceptual model. Additionally, we compare a 
systematic set of models that combine different putative learn- 
ing processes (perceptual models) with different ways in 
which the learned quantities drive behavior (response 
models). 




%CV 



%CV 



Figure 5. Ifi) Average RS in valid and invalid trials for the 3 (true) %CV levels. Error bars depict standard errors of tfie mean (SEM) (S) Illustration of how the observed RS costs 
after invalid cueing translate into RT differences (in ms). 
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Bayesian Model Selection 

Random effects BMS among the 3 perceptual model families 
(i.e., the full models and the 2 reduced model versions for 
each of the 3 response models) revealed that the full hierarch- 
ical Bayesian model had substantially higher model evidence 
than the 2 reduced (null) versions (Table 2). 

Comparing the 3 response model families (i.e., the pre- 
cision, belief and surprise models for each of the 3 versions 
of the perceptual model) showed that the response model 
based upon precision was clearly superior to the belief and 
the surprise model (Table 2). Finally, comparison of all 11 
individual models revealed that the full hierarchical Bayesian 
model combined with the precision response model was 
clearly superior to all other models we considered (Table 2, 
Supplementary Fig. 1). 



Parameters of the Winning Model 

The subject-specific values for log-volatility tu and meta- 
volatility i? derived from the full hierarchical perceptual 
model — based upon precision — are depicted in Figure 6A. 
Figure 6iJ shows the minimal and maximal RS for each 
subject as derived from the response model parameters ^.^ , 
^linvaiid' '"^'^ ^2 iri relation to the subject's overall (mean) RS. 
The graph shows that there were considerable differences in 
the absolute speed of responding across subjects, as parame- 
terized by averaged values for ^j^^^.^ and which were 
estimated from the individual datasets. 

In our hierarchical Bayesian scheme, the precision- 
weighting '7rj''/(Tr2''Trj'' + 1) at the second level plays the role 
of a (time-varying) learning rate that depends on the log- 
volatility, determined by tu and fju^. As shown previously 
(Mathys et al. 2011), this dependence — on higher order 
knowledge about change points in the environment — enables 
more adaptive learning in volatile environments, such as our 
paradigm. This is also reflected by the BMS results described 
above, where the hierarchical Bayesian model clearly outper- 
formed a standard Rescorla-Wagner model with a fixed 



Table 2 

Results of the Bayesian model selection |BMS| 



Main dataset 
(n = 15) 



Model 



PP 



XP 



Replication dataset 
(n = 16| 

PP XP 



Model family comparison — perceptual models 










Full hierarchical Bayesian family 


0.873 


0.999 


0.777 


0.997 


Reduced model family (i) = 0) 


0.064 


<0.001 


0.105 


0.001 


Reduced model family (Xj'' = 0) 


0.063 


<0.001 


0.118 


0.002 


Model family comparison — response models 










"Precision" family 


0.756 


0.991 


0.642 


0.930 


"Belief" family 


0.076 


0.001 


0.251 


0.066 


"Surprise" family 


0.168 


0.008 


0.107 


0.004 


Model comparison of all 1 1 models 










Full hierarchical Bayesian model "Precision" 


0.499 


0.995 


0.381 


0.914 


Reduced model (i? = 0| "Precision" 


0.006 


<0.001 


0.182 


0.074 


Reduced model [xf = 0) "Precision" 


0.119 


0.004 


0.047 


<0.001 


Full hierarchical Bayesian model "Belief" 


0.040 


<0.001 


0.041 


<0.001 


Reduced model ({> = 0| "Belief" 


0.040 


<0.001 


0.042 


<0.001 


Reduced model (xf = 0| "Belief" 


0.040 


<0.001 


0.074 


0.004 


Full hierarchical Bayesian model "Surprise" 


0.040 


<0.001 


0.079 


0.004 


Reduced model (i? = 0| "Surprise" 


0.040 


<0.001 


0.039 


<0.001 


Reduced model [xf = 0| "Surprise" 


0.040 


<0.001 


0.039 


<0.001 


Rescorla-Wagner model 


0.040 


<0.001 


0.038 


<0.001 


True categorical probability model 


0.040 


<0.001 


0.038 


<0.001 



learning rate. However, given the formal similarity of the 2 
models, one may expect to find a correlation between the 
fixed learning rate of the Rescorla-Wagner model and the par- 
ameters determining the learning rate of our hierarchical 
Bayesian model. Figure 7 depicts this relationship between 
the perceptual parameters m and iJ, and the learning rate e 
derived from the Rescorla-Wagner model. While there was a 
significantly positive correlation between the subject-specific 
volatility estimate m and learning rate e (r=0.69; P= 0.004), 
no relationship was observed between e and the meta- 
volatility i?(P>0.25) (Fig. 7). 

To illustrate different individual learning styles. Figure 8 
shows the exemplary time courses of the third and first levels 
of the Bayesian model for 2 subjects with distinct updating 
behavior. The 2 subjects show differences in the volatility esti- 
mate m as well as the meta-volatility estimate ■& (cf. Fig. 6 
where these subjects are indicated by stars). Although the 
meta-volatility estimate i? is higher in subject A than in subject 
B, subject B shows faster updating due to a higher volatility 
estimate m. In other words, our model shows that the first 
subject perceives the environment as substantially less volatile 
than the second subject. As the updates of jLtj ' (the estimated 
CV) are coupled to the estimated log-volatility jitj' this 
translates into a higher learning rate and quicker updating be- 
havior in the second subject, when the true underlying %CV 
changes. 
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Subject 

Figure 6. ifi] Illustration of the subject-specific patterns for the values of the 
volatility estimate w and the meta-volatility estimate iJ. (6) Illustration of minimal and 
maximal RS (as derived from the response model parameters (averaged for valid 
and invalid trials) and ^2) in relation to overall (mean) RS. The symbols single and 
double asterisks denote the data from subjects A and B depicted in Figure 8, 
respectively. 
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Figure 7. Relationship between the perceptual parameters co and and the 

Rescorla-Wagner learning rate e. 



To illustrate how RTs are related to the precision-based at- 
tentional factor a, we pooled RS over different bins of the at- 
tentional factor (using bins of 0.1, separately for valid and 
invalid trials) using estimates of trial-specific a based on the 
group average values for co and iJ. Figure 9 depicts the binned 
RS over subjects as a function of ct. A 2 (cue: valid, invalid) x 6 
(precision-based quantity a: 0.5, 0.6, 0.7, 0.8, 0.9, and 1.0) 
ANOVA revealed a significant main effect of cue (/^i,i4 = 11.8, 
P=.004) and a significant cuexa interaction effect 
(^3 44 48 09= 10.5, P<.001). We compared these empirical RS 
values to the RS predicted by the model. For this, we com- 
puted the expected RS as a function of a on the basis of the 
group average values for ^j^^^.^ and ^ij^^^ and ^2 (see Fig. 9). It 
can be seen that the observed RS shows a similar pattern as 
the predicted RS. As expected, as precision (confidence in the 
validity of the cue) increases, there is a RT benefit for valid 
trials and an equivalent cost for invalid trials. This illustrates 
that one can explain attention formally in terms of optimizing 
or learning the relative precision of competing sensory 
channels. 



Reproducibility of Results 

In the independent replication, the proportion of trials with 
missing eye data or missing or incorrect saccades amounted 
to 7.6 ± 2%, so that on average 92.4% of the trials (566 of 612 
trials) were analyzed. Excluded trials were due to anticipated 
responses (0.6 ±0.2%), incorrect or absent saccades 
(0.6 ±0.2%), saccades not starting from the fixation zone 
(3.3 ±1%), or missing data points (e.g., due to blinks) 
(3.1 ±1%). Note that due to the extended training and the 



increased number of resting periods, the amount of usable 
trials was higher than in the original study. 

The 2 (cue: valid, invaUd) x 3 (%CV: 50, 69, 88%) ANOVA 
on RS data gave the same results as for the original dataset. 
Specifically, it revealed a significant main effect of cue 
(fi 15 = 17.6, P= 0.001) reflecting faster responses (higher RS) 
on valid than on invalid trials. As before, the main effect of % 
CV was not significant but we observed a significant cue x % 
CV interaction effect (i='i.99,29.88 = 4.7, 0.017). As the data 
were derived from a within-subject cross-over design (where 
half of the subjects received the placebo tablet in the first 
session, while the placebo session for the other half of sub- 
jects was the second experimental session), we additionally 
tested for an effect of session order by adding this variable as 
a between-subject factor to the ANOVA. No main effect of 
session order (^=0.15) or interaction of session order with 
any of the other factors (all P> 0.28) was observed. 

The results of the Bayesian model comparison are shown 
in Table 2. Again, the full Bayesian model based upon pre- 
cision showed the highest exceedance probability (0.914) 
when compared with alternative models. For the winning 
model, we again observed a significant positive correlation 
between the Rescorla-Wagner learning rate e and co (r = 0.59, 
P= 0.017), while no such relationship was observed between 
£ and # (P=0.97). In summary, this second dataset provided 
a fuU replication of our original results. 

Discussion 

The present study analyzed saccadic RTs in a location-cueing 
paradigm with a volatile probabilistic context, probing Baye- 
sian theories of perceptual inference. Extending previous 
theoretical work (Feldman and Friston 2010), we were able to 
provide empirical evidence for the free-energy formulation of 
attention in the context of a Posner paradigm — where CV 
changed unpredictably in time, thus requiring the subject to 
learn about environmental volatility. Specifically, using a 
generic hierarchical Bayesian scheme (Mathys et al. 2011), we 
compared 3 alternative models of how subjects might update 
estimates of CV across trials (perceptual models) and crossed 
these with 3 alternative hypotheses about how posterior 
beliefs (precision, belief, and surprise) might inform decision 
making within trials (response models). The resulting 9 
models — and 2 control models — were optimized using empiri- 
cal measures of saccadic RS and their relative plausibility was 
evaluated using BMS. The results of this model comparison 
provided strong evidence in favor of the hierarchical Bayesian 
model combined with the precision response model (Table 2) 
and this finding was replicated in an independent dataset. This 
supports the notion that attention can be formulated as opti- 
mizing the confidence in (or precision of) the inference on 
sensory input (Friston 2009). In the following, we examine 
our results in more detail, discuss them in the context of pre- 
vious work, and outline future extensions. 

Our experimental paradigm differed from a conventional 
Posner task, in that the spatial cues predicted the target 
location with different probabilities at different times during 
the experiment, thus requiring the subject to infer CV while 
accounting for environmental volatility. Indeed, a convention- 
al ANOVA showed that the subjects' RS varied as a function 
of the (unknown) true probabilities, reflecting adaptation 
to the changing environmental statistics. In other words, 
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probabilistic context significantly influenced saccadic 
latencies, although the probabilistic structure of the task was 
changing in a way that was unknown to subjects. 

This relates to previous work in so far as it has been shown 
that (inverse) saccadic RTs are sensitive to the probability of 
the saccade target location when abrupt changes in location 
probability occur within an experimental block (Anderson 
and Carpenter 2006), or when different blocks employ 
saccade targets with different probabilities and/or stochastic 
properties (Brodersen et al. 2008). In contrast to our task, 
both these studies presented targets without preceding cues, 
and the latter study also examined learning of sequential (con- 
ditional) dependencies between successive stimuli according 
to a first-order Markov sequence. The present task used expli- 
cit cues to elicit spatial attention shifts and investigated how 
the impact of these cues depended on the subject's current 
belief (and its precision) about the cue-target contingency. 
Moreover, instead of presenting different experimental blocks 
with different probabilistic contexts, here we introduced a 
volatile environment with frequent but hidden changes of 
probabilistic context within one continuous trial sequence. A 
natural modeling framework for explaining the ensuing sacca- 
dic reactions is a hierarchical Bayesian learning model — 
where the subject's belief about the environment's volatility 
affects the updating of beliefs about the most likely saccade 
target location. Indeed, comparison of competing perceptual 
models showed that a full hierarchical perceptual model had 
higher evidence than reduced models; assuming either that 
subjects ignored prior knowledge about the volatile nature of 



the environment or that they did not use them for updating 
beliefs about current CV. Moreover, the optimal full hierarchi- 
cal Bayesian learning model showed higher model evidence 
than a Rescorla-Wagner learning model or a model which 
assumes that the subjects knew the true underlying probabil- 
ities. Interestingly, however, the subject-specific volatility par- 
ameter ft) significantly correlated with the learning rate e of 
the Rescorla-Wagner model, while no such relationship was 
observed for the meta-volatility parameter The effects of 
the BMS as well as the relationship to the learning parameter 
of a Rescorla-Wagner model could be replicated in an inde- 
pendent dataset. 

Hierarchical Bayesian models have been used previously to 
successfully explain various aspects of human behavior under 
uncertainty, such as binary choices (Behrens et al. 2007) or 
RTs (den Ouden et al. 2010). These studies, however, 
assumed an ideal Bayesian observer with no interindividual 
variation in the learning process per se. In contrast, we fol- 
lowed the meta-Bayesian approach of Daunizeau, den Ouden, 
Pessiglione, Kiebel, Stephan et al. (2010), Daunizeau, den 
Ouden, Pessiglione, Kiebel, Friston et al. (2010) and inferred 
subject-specific parameters of a Bayes-optimal learning 
scheme (Mathys et al. 2011) from empirical responses. Our 
results showed that there is considerable interindividual varia- 
bility, even within our group of young healthy subjects (cf. 
Figs. 6 and 8). An obvious and important extension of the 
present work is to relate this variability to demographic or 
neurobiological factors. In fact, the work reported here is a 
prelude to future psychopharmacological and patient studies. 
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Figure 8. Illustration of the time course of ^13 (upper panels) and 5(1(2) (lower panels) during observation of Xi (black diamonds) for 2 exemplary subjects with different 
parameters for co and #. The true %CV is depicted as a dotted line. It can be seen that subject A {co = -B.09; &= 0.97) shows slower updating of the probability estimate 
that the target will appear at the cued location than subject B [co = -2.78; & = 0.12). This can be attributed to subject A's lower value of a (reflecting the subject's belief in a 
less volatile environment). 
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in which we will examine the putative relationship between 
individual differences in learning and attention (as encoded 
by our model parameters) and individual differences in 
neuromodulatory processes (as induced by medication, aging, 
or disease). In this context, the current results can be seen as 
trying to establish the construct validity or our paradigm and 
its modeling. 

Moreover, we introduced and tested different response 
models, i.e., mappings from posterior beliefs provided by the 
perceptual model to observable behavior. These response 
models account for the individual variability in the overall 
speed of responding (Fig. 6B), but differ in assuming whether 
precision of predictions, strength of the prediction about CV 
or surprise, respectively, determine saccadic RS. Our results 
showed that model evidence was highest for the response 
model in which RS was determined by the precision of the 
prediction. 

In one sense, our findings from the Bayesian model com- 
parison — that precision was the most plausible account for RT 
benefits — should not be surprising. This is because precision 
plays the role of a rate constant in evidence accumulation 
schemes based upon predictive coding (Feldman and Friston 
2010). In other words, precision modulates the gain of predic- 
tion error in driving changes in conditional representations or 
expectations. This means that sensory channels that enjoy 
greater precision will engender faster changes in high-level 
representations and lead to more rapid perceptual conver- 
gence. Behaviorally, this should be manifest in speeded up 
RTs. Exactly the same theme is seen at higher levels of the 
hierarchy — that concern slower timescales — such as inference 
about the probabilistic (trial-to-trial) contingencies, we ma- 
nipulated in our volatility paradigm. Here, the rate constant 
corresponds to a learning rate in conventional (reinforcement 
learning) formulations. In short, sensory evidence and empiri- 
cal priors that are afforded greater precision have preferential 
access to higher levels in hierarchical inference. This is ex- 
pressed as more efficient and faster convergence in those pro- 
cessing streams — and provides a nice metaphor for attention. 



In other words, attention corresponds to optimizing esti- 
mates of precision in sensory hierarchies and is implemented 
by changing the postsynaptic gain of neuronal prediction 
error units. Hence, attention determines which part of the 
sensorium is treated as furnishing precise information. In this 
respect, this approach is perfectly congruent with spotlight or 
zoom lens theories of attention (Posner 1980; Eriksen and St 
James 1986) as well as with the biased competition model 
(Desimone and Duncan 1995): the limitation of processing 
capacities demands a selection of stimulus locations or fea- 
tures so that only the most relevant receive full attention. Neu- 
robiologically, this is likely reflected in increased synaptic 
gain and neuronal synchronization, manifesting as enhanced 
firing rates (e.g.. Luck et al. 1997) or blood-oxygen-level-de- 
pendent responses (e.g., Brefczynski and DeYoe 1999; 
Kastner et al. 1999) in visual cortex, when attention is 
directed to a particular spatial location. It may also be note- 
worthy that, at the synaptic level, precision-dependent synap- 
tic gain (e.g., at superficial pyramidal cells) may be controlled 
by classical neuromodulators such as dopamine or acetyl- 
choline (Friston 2009)- In predictive coding schemes, in- 
creased gain boosts the sensitivity of principal cells sending 
forward afferents to higher levels (such as the intraparietal 
sulcus [IPS] or the FEF), so that evidence accumulates more 
rapidly and saccades are elicited more quickly. This notion 
resonates with findings from several recent studies. For 
example, Saproo and Serences (2010) showed that spatial at- 
tention increases the mutual information of population 
responses in early visual cortex and suggested that this 
should enable higher visual areas to read out this information 
more quickly and efficiently. This is similar to the proposals 
by Feldman and Friston (2010) and in this article, where 
higher precision at lower levels induces more rapid changes 
in the activity of higher level areas. Others have suggested 
that attention produces behavioral improvements by effi- 
ciently selecting the "relevant" sensory signals (Pestilli et al. 
2011); the suggested mechanism (focusing on the magnitudes 
of signals and employing pooling operations) differs in detail 
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Figure 9. {A) Observed and predicted average RS in valid and invalid trials as a function of the precision-dependent attentional weight parameter a (attention to cued location; 
calculated for the group average values). Error bars depict standard errors of the mean (SEM). The lines correspond to the predictions using the average response model 
parameters, over subjects. (6) Illustration of how the observed RS costs after invalid cueing translate into RT differences (in ms). 
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from mechanisms assumed in Feldman and Friston (a simple 
modulation of postsynaptic gain) but both call upon nonlinear 
(pooling and selection) mechanisms. It would be interesting 
to see whether the results obtained by Pestilli et al. on 
behavioral contrast-discrimination performance could be re- 
plicated when trials are grouped according to precision esti- 
mates. Finally, it has been shown that electrical stimulation of 
direction-selective neurons in MT elicits faster perceptual 
decisions due to faster evidence accumulation (Ditterich et al. 
2003). 

According to predictive coding implementations of hier- 
archical Bayesian inference, the gain of prediction error 
associated with bottom-up signals corresponds to the pre- 
cision of those prediction errors. Physiologically, this means 
that precision may be encoded by the gain of superficial pyr- 
amidal ceils (Brown and Friston 2012). Accordingly, our com- 
putational model would predict that during spatial attention, 
activity in hierarchically related visual areas should exhibit 
precision-dependent modulatory effects that result from the 
enhanced gain of superficial pyramidal cells. This hypothesis 
— as well as questions about where in the spatial attention/ 
saccade network precision exerts this effect — could be tested 
with DCM of electroencephalographic or magnetoencephalo- 
graphic data (Bastos et al. 2011; Brown and Friston 2012). 
Interestingly, a recent fMRI study, using a simpler DCM for 
fMRI, has highlighted the importance of the modulation of 
inhibitory self-connections in visual areas by attention and 
prediction (Kok et al. 2012). This type of modulation corre- 
sponds (phenomenologically) to a simple gain control 
mechanism that may reflect the precision-dependent modu- 
lation of pyramidal cells described above. 

Given the involvement of common areas (FEF and IPS) in 
both covert attentional orienting of attention and overt eye 
movements (Corbetta et al. 1998; Nobre et al. 2000; Perry and 
Zeki 2000; Beauchamp et al. 2001; de Haan et al. 2008), the 
psychophysical evidence for an inherent link between atten- 
tion shifts and saccade programming (Deubel and Schneider 
1996; Godijn and Theeuwes 2003; Dore-Mazars et al. 2004; 
Deubel 2008), and the existence of both visual and motor 
neurons in key structures such as the FEF (e.g., Bruce and 
Goldberg 1985; Schall and Hanes 1993), it seems plausible 
that precision should affect both sensory-perceptual as well as 
motor preparatory processes (cf. the model proposed by 
Schall et al. 2011). Hence, one could also frame the processes 
studied here in the broader context of visual-saccadic decision 
making (see Glimcher 2001, 2003 for comprehensive 
reviews). 

The focus of the present study was on explaining observed 
trialwise saccadic RS using a generative (hierarchical Baye- 
sian) model and on using model selection to disambiguate 
among different ways of updating beliefs about upcoming 
target locations in a volatile environment. While our analyses 
suggest a precision-based mechanism for spatial attention, it 
remains to be investigated where these precision estimates are 
computed within the hierarchical visual attention/saccade 
network. The present behavioral-modeling results are a foun- 
dation for future imaging studies that will exploit the across- 
trial and between-subject variation in model states and 
parameters to identify the network of regions in which pre- 
cision plays a role for belief updating in spatial attention. We 



imagine that neuroimaging studies could use the time series 
of the states of our perceptual model as predictor variables to 
identify their neuronal correlates (cf. Behrens et al. 2007; den 
Ouden et al. 2010). Furthermore, as mentioned above, 
subject-specific estimates of the parameters encoding individ- 
ual learning style can be used at the between-subject level to 
reveal the neuronal substrates of interindividual differences. 

Conclusion 

We have used a new formal framework for characterizing 
Bayes-optimal trial-by-trial updating of probabilistic beliefs 
under uncertainty for explaining attentional mechanisms. 
Specifically, we characterized saccadic RS during an extended 
Posner paradigm with variable CV. Comparing 11 alternative 
models, we found that empirical responses are most plausibly 
explained as a function of precision (of the beliefs about the 
causes of sensory input). This finding is consistent with 
attention theories derived from Bayesian theories of brain 
function (the free-energy principle) that equate spatial atten- 
tion to a precision-dependent gain modulation of sensory 
input. Future neuroimaging work could use the modeling 
approach introduced in this article to identify the neural and 
neurochemical basis of attentional selection and saccadic eye 
movements, in relation to probabilistic expectancies. 
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Supplementary material can be found at: http://www.cercor. 
oxfordjournals . org/. 
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Appendix 

Update Equations of the Perceptual Model 

The variational inversion method introduced in Mathys et al. 
(2011) yields closed-form one-step update equations for the 
sufficient statistics of the posterior distributions representing 
beliefs about the hidden states x of the agent's environment. 
In the specific perceptual model depicted in Figure 2, state Xi 
is observed, whereas X2 and X3 remain hidden. As posteriors 
are assumed to be Gaussian, the relevant sufficient statistics 
are the means 1x2, /J.3 and precisions (inverse variances) 712, 
•773 of the distributions for X2 and X3. It turns out that the 
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updates of the means take the form of precision-weighted 
prediction errors: 



,"2 J,^i>2 , 



(Al) 



(A2) 



relationship between a and RS, parameterized by the 2 par- 
ameters and ^2 (cf- eq. 5 and Fig. 3). a represents the pro- 
portion of total attentional capacity that is allocated to the 
cued location (and therefore lies in the unit interval) and 
should amount to 0.5 if both target locations are equally 
likely. These constraints, which all response models conform 
to, can be summarized as: 

CI: 0<a<l, 



where for clarity and to reveal the conceptual meaning of 
these equations, we make use of the following definitions: 



C2: 



0.5 for/ii = 0.5. 
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(A3) 



(A4) 



(A5) 



(A6) 



Given these constraints, our response models differ in which 
attribute of the predicted validity of the cue maps to the atten- 
tional factor a (and thus determines RS in eq. 5). The func- 
tional forms of these models are motivated in the following 
and are depicted graphically in Figure 4. (Note that the verti- 
cal axis in Fig. 4 is attention to outcome location. For valid 
trials, this is equal to attention to cued location a, while for 
invalid trials it is 1 — a.) 

The "precision" model (eq. 6) links attention to the pre- 
cision of predictions as suggested by Feldman and Friston 
(2010). In our specific case, the precision of the prediction at 
the first level (tti) has a minimal value of 4 when jx^ = 0.5 
and approaches infinity as /ij approaches 1 (cf. eq. A6). The 
most parsimonious way to meet the above constraints CI and 
C2 is to define a as the logistic sigmoid of tti, minus its 
minimum (cf. eq. 6): 



S<"def^<" - 



(A7) 

^2'' is the agent's estimate of the variance of the random walk 
in X2 before receiving input Xj''; ttI'' are the precisions of the 
predictions about states xj'* and S,-'' are the prediction errors. 
Note that 62'' is a prediction error referring not to the value of 
X2 but to its log volatility; therefore, it is determined by the 
ratio of observed to predicted total variance in X2 . 

It is obvious from equations (Al) and (A2) that updates are 
always proportional to the prediction error about the input 
from the level below S,_i and to the precision Tr,_i of the pre- 
diction about the state at the level below. 

The update equations for the precisions ; 



i are 



J') 
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Note that as the cue becomes a counter indication of outcome 
location when falls below 0 (or equivalently, when jl^ 
drops below 0.5), a suitable definition of a for the whole 
range of /ij is 



^« =s(sign(/x^ib(^r«-4)). 



(A12) 



This ensures that attention to the cued location falls to 0 as /Ij 
approaches 0. 

A simpler model of attention allocation given a cue-induced 
belief about outcome is that attention is proportional to pre- 
dicted probability of outcome: if the agent believes that the 
probability of seeing outcome "left" is P (e.g., 80%), then it 
will allocate proportion P (i.e., 80%) of its attentional re- 
sources to location left. We call this the belief model (cf. eq. 
7). In terms of our perceptual model, the predicted prob- 
ability of a valid trial is simply /ij : 



with 



(A13) 



(AlO) 



For the derivation of these equations, see Mathys et al. 
(2011). 



Response Models 

In the following, we explain the functional form of our 3 
response models in more detail. All models assume a linear 



Finally, the surprise model describes the attentional factor a 
as a function of the Shannon surprise (the negative logarithm 
of the probability of the outcome being the cued location 
given the prediction /ij) (cf. Bestmann et al. 2008). For a pre- 
dicted probability /ij = 1 of a valid trial, surprise is zero, 
whereas for /tj = 0, it is infinite. In the first case, attention is 
therefore allocated exclusively to the cued location (i.e., 
a = 1), whereas in the second case, attention is allocated ex- 
clusively to the noncued location (i.e., a = 0). The simplest 
way this can be achieved under consideration of constraints 
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CI and C2 is (cf. eq. 8): 

1 + surprise(/ij ) 

where surprise(/i|'') = — log2/'(xf' = = — log2(/ii'')- 

Note that we make use of log2:K = lnx/ In 2 to ensure we 
meet constraint C2. 
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